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1.   INTRODUCTION 

In  their  search  for  increased  computer  speed  and  throughput,  Hobbs 
and  Theis  [2]  maintain  that  parallel  processing  is  a  solution  for  problems 
with  inherent  parallelism.   This  property  of  parallelism  allows  various  oper- 
ations to  be  performed  concurrently.   To  demonstrate,  assume  that  the  following 
two  FORTRAN  statements  are  given. 

(1)  F  =  A  +  B  +  C+D 

(2)  T  =  V  +  X  +  Y  +  Z 

Statement  2  does  not  rely  on  statement  1.  Therefore,  they  may  be 
executed  concurrently.  On  two  parallel  processors  then,  both  statements  would 
be  completed  in  the  amount  of  time  needed  to  process  one  statement.  We  can 
also  consider  parallelism  on  the  operator  level.   In  statement  1,  A  +  B  can  be 
computed  at  the  same  time  C  +  D  is  being  calculated.   Increasing  the  number  of 
computers  would  again  decrease  the  amount  of  time  needed  to  complete  the  oper- 
ations.  In  other  words,  parallel  processing,  in  our  example,  would  increase 
computer  speed  and  throughput. 

Parallelism  is  inherent  in  FORTRAN  programs  according  to  David  Kuck, 
Yoichi  Muraoka,  and  S.  C.  Chen  [8].   Therefore,  if  we  had  parallel  processors 
with  arithmetic  units  which  can  do  any  of  four  operations  (add,  subtract, 
multiply,  and  divide)  as  well  as  initiating  two  other  operations  (fetch  and 
store),  then  we  will  have  greater  speed  and  throughput  than  a  conventional 
serial  processor. 

This  paper  will  present  some  of  the  aspects  behind  the  project  under- 
taken to  implement  Kuck's  proposal  [7]  and  describe  a  compile-time  operation 
scheduler  for  parallel  processors.   In  addition,  the  experiments  which  were 
run  with  the  scheduler  and  their  results  will  be  discussed. 


2.   ASPECTS 

2.1  Introduction  to  the  Scheduling  Problem 

Ever  since  computers  came  into  being,  there  have  been  tradeoffs  of 
one  kind  or  another.  The  most  emphasized  seems  to  be  the  ratio  of  cost  to 
speed  of  computation.  Another  which  deserves  some  attention  is  the  one  be- 
tween compilation  time  (which  includes  any  time  spent  in  preparation  prior  to 
actual  program  execution)  and  execution  time.   The  considerations  that  follow 
deal  with  a  fixed  number  of  parallel  processors. 

The  tradeoff  concerns  the  time  when  scheduling  of  operations  is  done. 
If  we  execute  operations  as  they  are  required,  then  operands  must  be  fetched. 
As  a  result,  execution  time  would  be  longer  than  if  we  had  some  amount  of  look- 
ahead  so  that  the  operands  could  have  been  available  at  the  time  the  operation 
was  required.  Other  aspects  of  this  problem  are  the  minimization  of  processor 
idle  times  and  the  optimization  or  processor  run  speeds.  These  two  can  be  ac- 
complished by  establishing  for  each  of  the  parallel  processors  a  sequence  of 
operations  to  be  done.  However,  this  ordering  process  would  require  a  longer 
compilation  time.   If  we  were  given  statement  1,  on  page  1,  along  with  two 
parallel  processing  elements  (PEs)  and  the  four  operands  already  fetched,  we 
may  assign  the  execution  of  A  +  B  to  PE  1  and  the  execution  of  C  +  D  to  PE  2. 
When  these  operations  are  completed,  PE  2  would  add  the  two  results.   This  then 
describes  an  order  of  operation  for  each  of  the  processors. 

Furthermore,  Paul  Kraska  has  developed  an  algorithm  (algorithm  a) 
which  calculates  an  order  of  operations  in  reduced  time  [5l«   The  PEs  are 
assumed  to  be  capable  of  doing  addition,  subtraction,  multiplication,  and 
division.   Each  processor,  when  done  with  its  task,  does  not  wait  for  other 
processors  to  finish  before  it  proceeds  to  another  operation.  Along  with  the 


four  arithmetic  operations  named  above,  the  processors  are  capable  of  initiating 
stores  and  fetches  of  data  to  and  from  memory,  respectively. 

The  minimal  time  taken  to  process  a  set  of  operations  is  dependent 
upon  how  many  operations  must  be  performed  in  sequence,  as  opposed  to  oper- 
ations being  completed  in  parallel;  and  how  long  each  of  the  operations  in  the 
sequence  takes.   If  we  represent  the  sequences  in  this  set  of  operations  as  a 
tree,  each  operation  would  be  represented  by  a  node.  The  time  taken  to  complete 
an  operation  corresponds  to  the  weight  assigned  to  the  associated  node. 

Since  the  operations  must  be  done  in  some  order,  the  operations  tree 
is  a  partial  ordering  on  nodes.   For  an  illustration  of  the  ordering,  consider 
Figure  1.  The  operations  associated  with  nodes  2  and  3  are  to  be  completed 
before  the  operation  represented  by  node  1  is  begun.  Nodes  2  and  3  are  said  to 
be  predecessors  of  node  1  in  the  tree.   In  this  ordering  nodes  k,    5,  and  6  are 
root  nodes;  and  node  1  is  a  terminal  node. 


Figure  1.  An  Example  of  Partial  Ordering 

The  "by-demand"  scheduling  is  discussed  in  a  paper  by  Larry  Swanson 
[12].  He  is  concerned  with  the  data  management  problem.  He  presents  a  machine 


configuration  of  a  tree  processor  system  associated  with  a  number  of  routing 
networks.  He  has  programs  which  simulate  various  combinations  of  the  log 
router,  the  Illiac  IV  router,  the  Semmelhaack  router,  the  Batcher  network,  and 
the  crossbar  switch.  He  discusses  the  amount  of  time  taken  to  transfer  data 
to  and  from  memory  and  between  PEs.   The  time  delays  influence  the  size  and 
speed  of  the  tree  machine  since  the  larger  the  machine,  the  longer  the  route 
data  must  travel. 

This  paper  will  discuss  the  program  written  to  implement  Kraska's 
algorithm  a  .  It  will  also  present  some  experiments  which  used  the  program 
along  with  their  results. 

2.2  General  Description  of  Algorithm 

Starting  with  a  lower  bound  on  the  number  of  processors,  m,  Kraska's 
algorithm  a  provides  a  way  of  finding  the  least  upper  bound  on  the  number  of 
processors  needed  to  complete  the  operations  represented  by  an  operations  tree 
in  the  minimal  amount  of  time. 

As  operations  are  scheduled,  their  nodes  are  placed  into  lists  which 
correspond  to  the  processors.   The  lists  would,  at  the  end,  tell  us  which 
operations  are  to  be  executed  by  which  PE  and  in  what  order.   Since  the  weight 
of  each  node  corresponds  to  the  amount  of  time  required  to  complete  the  asso- 
ciated operation,  the  lists  then  indicate  when  an  operation  has  been  executed 
by  the  position  of  the  node  in  a  list  and  the  weights  of  the  nodes  below  it. 
During  scheduling  then,  each  list  has  a  height. 

For  each  node,  n.,  find  the  latest  time  it  may  be  executed.   This  is 
done  by  finding  the  largest  sum  of  the  weights  of  all  the  nodes  between  node 
n.  and  a  root  node,  including  the  weight  of  the  root  node.   The  path  length  for 
node  n.  at  any  particular  time  is  the  sum  of  the  minimal  height  for  the  lists 


and  the  largest  n.-to-root  weight  sum.  This  path  length  indicates  the  time 
the  operations  on  the  longest  path  through  the  operations  tree  will  he  com- 
pleted. The  minimal  time  is  just  the  longest  path  length  through  the  tree. 
This  time  is  called  the  critical  path  length. 

To  find  a  schedule  which  will  be  completed  in  minimal  time,  we 
consider  the  terminal  nodes  in  the  operations  tree.  ^Insert  the  node(s)  with 
the  maximum  path  length  into  the  lists.   Then  consider  the  tree  with  the  in- 
serted node(s)  removed.   With  the  new  tree  we  start  all  over  again  at  *.   If 
we  can  continue  down  to  all  the  terminal  nodes  and  insert  them  into  the  lists 
without  getting  a  path  length  greater  than  the  critical  path  length  for  the 
operations  tree,  we  will  have  succeeded  in  our  scheduling  for  minimal  time; 
and  the  least  upper  bound  on  the  number  of  processors  will  have  been  reached. 
If,  however,  a  path  length  is  calculated  which  is  greater  than  the  critical 
path  length,  we  must  start  from  the  beginning  again,  but  this  time  with  an 
added  processor.  Hence,  another  list  is  available  for  scheduling,  for  a  total 
of  m  +  1  lists. 

Algorithm  a  also  provides  a  way  of  scheduling  an  operation  tree  for 
a  fixed  number  of  processors  regardless  of  the  critical  path  length.   In  this 
case  it  is  necessary  to  provide  a  back-up  facility  so  that  when  an  assignment 
is  made  which  denies  assignment  to  a  node  awaiting  scheduling  and  which 
results  in  a  longer  path  length  than  the  critical  path  length,  we  are  able  to 
backtrack  to  the  point  the  node  was  denied  assignment,  insert  it  into  the 
lists  and  start  off  from  there.   This  method  provides  a  schedule  for  reduced 
time  given  a  fixed  number  of  processors. 

No  matter  how  many  processors  are  used,  the  critical  path  length 
still  represents  the  least  amount  of  time  necessary  for  the  operation  tree  to 


be  processed.   Thus,  if  we  are  to  schedule  on  a  greater  number  of  PEs  than  the 
least  upper  bound,  the  main  difference  in  the  results  would  be  a  greater  per- 
centage of  processor  idle  time  than  if  we  scheduled  on  the  least  upper  bound 
number  of  processors. 

The  weights  associated  with  each  node  is  dependent  upon  the  time 
needed  to  complete  the  corresponding  operation.  Therefore,  if  we  assign  the 
weights  such  that  certain  operations  take  longer  than  others,  each  processor, 
in  effect,  would  seem  to  proceed  independent  of  the  other  PEs.  On  the  other 
hand,  if  we  assign  equal  weights  to  all  operations,  then  each  processor  would 
start  and  finish  an  operation  in  time  with  the  other  processors.  We  say  that 
the  nodes  are  unweighted  or  unit  weighted  since  the  time  spans  are  all  the 
same. 

2.3  Preparation  of  FORTRAN  Programs 

A  FORTRAN  program  has  many  other  types  of  statements  besides  a 
simple  assignment  statement.   Functions  and  subroutine  calls,  at  present,  are 
ignored.   Built-in  functions  are  replaced  by  a  number  of  adds  and  multiplies. 
Transcendental  functions  are  replaced  by  an  expression  consisting  of  a  sum  of 
products.  Arithmetic  functions  are  regarded  as  subscripted  variables.   DO 
loops  and  IF  statements  are  taken  care  of  by  back  substitution,  recursion,  and 
tree  height  reduction.   These  processes  are  described  in  Han's  paper  [l] .  Aftei 
these  operations  have  been  applied  to  the  FORTRAN  program,  it  becomes  a  series 
of  assignment  statement  blocks  which  involve  only  the  four  arithmetic  oper- 
ations named  earlier  along  with  store,  fetch,  and  a  transfer  of  control  oper- 
ation. 

Calls  to  subprograms  and  certain  other  FORTRAN  statements  like  GO  TO 
signal  the  end  of  each  assignment  block  [11] .  Along  the  length  of  each  block, 


variables  are  back  substituted  recursively,  variables  and  sub-expressions  are 
distributed  over  each  arithmetic  expression  as  specified  by  the  Multiplication 
Distribution  Algorithm  and  the  Division  Distribution  Algorithm  as  described  by 
Han  [l]  and  Krasha  [5].   These  two  algorithms  specify  distribution  only  if  the 
number  of  operation  levels,  the  number  of  nodes  in  the  longest  path  from  a 
terminal  node  to  a  root  node,  will  decrease  by  the  distribution  process.  Thus, 
the  time  taken  to  complete  execution  of  any  statement  will  be  reduced,  never 
lengthened  by  the  process. 

One  aspect  which  should  be  considered  is  how  the  distribution  process 
affects  the  number  of  processing  elements.  Although  the  number  of  levels  may 
be  decreased  by  the  distribution  process,  the  number  of  processors  needed  for 
executing  in  the  reduced  time  never  decreases.   Since  distribution  introduces 
operations  into  the  levels,  it  is  often  true  that  more  processors  are  needed 
than  before  distribution.   Experiments  conducted  bear  out  this  theory. 

After  an  assignment  statement  block  has  been  back  substituted  and 
distributed,  the  corresponding  trees'  heights  are  reduced;  and  a  connectivity 
graph  is  made  in  the  following  way.   Define  all  "fetches"  to  be  root  nodes. 
Define  all  "stores"  to  be  terminal  nodes.  The  rest  of  the  matrix  is  constructed 
by  precedence.  As  an  example,  consider  the  statement 

(3)  x=b-a*c  +  d/e*f 

The  graph  for  it  is  shown  as  Figure  2.1. 

Relax  this  graph  in  the  following  manner.   First,  place  all  terminal 
nodes  at  the  bottom  level.   Secondly,  place  at  the  next  to  the  bottom  level  all 
nodes  which  are  predecessors  to  the  terminal  nodes  only.  Place  at  the  third 
from  the  bottom  level,  all  predecessors  of  the  nodes  at  the  lower  levels.   Con- 
tinue in  this  fashion  until  all  the  root  nodes  have  been  placed  into  the  graph. 


Figure  2.1.  A  Connection  Graph  for  (3) 


This  resulting  graph  is  relaxed.  Now  create  the  n  x  n  connectivity  matrix 
where  the  graph  has  n  nodes.   The  ij  entry  in  the  matrix  is  1  only  if  node  i 
precedes  node  j  in  the  graph.   Notice,  however,  that  the  relaxed  graph  will 
have  a  node  ordering  different  from  the  original  connection  graph  if  we  start 
numbering  at  the  root  level  and  proceed  toward  the  terminal  level.   Notice 
also  that  permutations  on  the  numbering  at  each  level  will  also  result  in  a 
relaxed  graph  as  illustrated  in  Figures  2.2  and  2.3.  Figure  2.k   shows  the 
connectivity  matrix  for  (3)  based  on  the  relaxed  graph  in  Figure  2.2. 

Notice  that  the  relaxed  matrix  is  upper  triangular.   This  is  a 
consequence  of  the  relaxation  process.   The  row(s)  of  zeroes  at  the  bottom 


Figure  2.2.  A  Relaxed  Graph 


Figure  2.3-  Another  Relaxed  Graph 
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indicate  the  terminal  node(s),  while  the  columns  of  zeroes  at  the  left  indicate 
root  node(s) . 

A  weight  vector,  NW,  is  made  from  the  weights  of  the  nodes.  A  type 
vector,  NT,  defines  the  operation  involved  at  each  node.   Thus,  NW.  and  NT. 
tell  how  much  time  is  required  to  complete  the  operation  of  type  NT.  for  node  i 
in  the  graph.   The  weight  vector  for  the  relaxed  matrix  is  shown  in  Figure  2.5 
with  the  assumption  that  fetches  and  stores  take  1  time  unit,  adds  and  subtracts 
take  2,  multiplies  take  3,  and  divides  take  5  time  units.   If  each  fetch  is 
coded  with  a  number  greater  than  11,  symbolizing  the  variable  fetched,  add  is 
coded  as  h,    subtract  as  5,  multiply  as  6,  divide  as  7>  and  store  as  11,  then 
the  type  vector  is  shown  in  Figure  2.6. 

The  weight  vector  is  then  used  to  compute  path  lengths  and  lists' 
heights.   The  type  vector  is  a  convenience  supplied  to  provide  some  statistics 
on  the  operations  at  each  level.  A  table  of  occurrence  is  outputted  to  show 
the  frequency  of  each  operation  at  the  different  levels. 


12 
3 .   A  DESCRIPTION  OF  WSCHED 

3.1  A  General  Description  of  WSCHED 

WSHED  is  a  PL/I  procedure  which  implements  Kraska's  algorithm  a  , 
described  earlier  in  this  paper.   It  receives  as  parameters  a  connectivity 
matrix  along  with  a  weight  vector,  a  type  vector,  and  the  dimension  N.  First, 
the  N  X  N  BIT(l)  connectivity  matrix  is  relaxed.   (See  pages  10-11  and  16-I7 
for  discussions  of  the  relaxation  procedure) .  Relaxation  produces  another 
result:   the  number  of  operation  levels  in  the  tree.   The  levels  give  us  a 
precedence  relation.   If  node  i  is  in  level  m  and  node  j  is  in  level  n,  m  >  n; 
then  if  there  is  a  path  between  i  and  j,  i  is  a  predecessor  of  j .   The  levels 
are  numbered  from  bottom  to  top.  Therefore,  the  level  of  terminal  nodes  is 
level  1. 

Second,  since  the  levels  of  the  tree  represent  a  precedence  relation  i 
and  the  matrix  represents  a  connectivity  relation,  we  can  compute  the  critical 
path  length  and  the  paths  for  any  node  to  a  root  node.   From  these,  we  calculate, 
the  lower  bound  on  the  number  of  processors  needed  to  complete  the  operations 
represented  by  the  matrix  and  the  two  vectors.   This  lower  bound  is  found  as 
follows.   Calculate  at  each  level  the  sum  of  all  the  node  weights  for  nodes  on 
and  below  the  level.  Divide  the  sum  by  the  maximum  path  length  up  to  the  level. 
The  smallest  integer  not  less  than  the  largest  ratio  is  the  lower  bound  on  the 
number  of  processors. 

Since  algorithm  a  contains  an  option,  the  signal  indicator  SCHEDPE  is 
provided  by  the  user  to  determine  whether  WSCHED  is  to  find  the  least  upper 
bound  on  the  number  of  processors  to  complete  operation  in  minimal  time  or 
WSCHED  is  to  simply  find  a  schedule  for  the  number  of  processors  indicated. 

If  SCHEDPE (l)  =  0,  we  want  to  find  the  least  upper  bound.   Starting 
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with  the  lower  bound  obtained  above,  procedure  JCOMP  is  called  which  determines 
the  scheduling  of  the  processors.   Should  JCOMP  fail  to  find  a  schedule  which 
would  complete  execution  in  minimal  time,  it  returns  a  KODE  =  1  indicating 
failure.  Upon  seeing  this  code,  WSCHED  will  add  another  processor  to  the 
group  and  recall  JCOMP.  Eventually  JCOMP  will  return  a  KODE  =  0,  indicating 
that  an  ordering  of  operations  exists  which  will  have  the  operations  completed 
in  the  minimal  time.  At  this  point,  the  number  of  processors  available  to 
JCOMP  is  the  least  upper  bound  on  the  number  of  processors  needed  to  execute 
the  operations  represented  by  the  matrix  and  the  two  vectors  in  minimal  time. 
We  also  know  the  order  in  which  the  operations  are  done,  which  operation  is 
executed  by  which  processor,  and  what  percentage  of  the  time  each  of  the 
processors  is  idle. 

If  SCHEDPE(l)  /  0,  it  contains  the  number  of  scheduling  trials  to  be 
made  for  certain  numbers  of  processing  elements.   SCHEDPE(2)  through  SCHEDPE 
(number  of  trials  plus  one)  determine  the  number  of  processors  to  be  used  during 
trials  1  through  SCHEDPE (l),  respectively.   During  each  trial,  JCOMP  will 
schedule  the  nodes  on  the  specified  number  of  arithmetic  processors,  regardless 
of  the  critical  path  length.  The  resulting  order  is  a  schedule  which  will 
complete  the  operations  as  quickly  as  possible. 

It  should  be  noted  that  for  any  number  of  processors  less  than  the 
least  upper  bound  (LUB)  calculated  in  WSCHED,  the  amount  of  time  needed  for 
completion  necessarily  increases.   There  is  no  use  having  the  number  of  proces- 
sors greater  than  the  LUB  since  the  critical  path  length  cannot  decrease. 
Therefore,  having  extra  processors  only  results  in  a  higher  percentage  of  idle 
time  for  the  processors. 


Ik 

3 .2  The  Relaxation  Process 

The  relaxation  of  the  connectivity  matrix  requires  row  and  column 
interchanges  since  we  may  want  to  move  a  node  in  level  i  down  to  level  j, 
i  >  j .  We  usually  have  more  than  one  node  moving  through  levels  in  a  relax- 
ation.  Therefore,  it  would  he  too  time  consuming  to  physically  interchange 
the  rows  and  columns.  As  a  result,  pointers  are  used  to  effect  the  exchange; 
and  it  is  not  "until  the  very  end  that  the  physical  change  takes  place. 

Since  the  position  of  a  node  after  relaxation  does  not  necessarily 
correspond  to  its  position  prior  to  the  exchanges,  another  set  of  pointers 
keeps  the  corresponding  old  position  for  each  node.  Hence,  at  the  end  of  the 
scheduling,  node  6  which  corresponded  to  row  6  and  column  6  in  the  original 
connectivity  matrix  became  node  10  corresponding  to  row  10  and  column  10  in 
the  relaxed  matrix;  if  JCOMP  finds  that  node  10  is  in  list  2,  we  actually 
have  PE  2  executing  the  original  node  6.  An  example  of  tree  relaxation  was 
given  earlier.  An  example  of  a  graph  relaxation  may  be  seen  in  Figures  3-1, 
3-2,  and  3.3- 

Notice  that  Figure  3*3  would  give  rise  to  a  relaxed  matrix  where 
node  i  corresponds  to  row  i  and  column  i,  while  Figure  3-2  would  not.   The 
node  numbering  in  Figure  3*2  show  what  happened  to  each  original  node.   This 
would  be  the  intermediate  result  of  the  phase  when  pointers  are  utilized  to 
effect  row  and  column  exchanges.  The  renumbering  process  after  relaxation 
would  yield  Figure  3«3« 

3 -3  General  Description  of  JCOMP 

Starting  with  the  terminal  nodes,  JCOMP  calculates  their  longest 
paths  through  the  tree.  Assign  processors  to  the  terminal  node(s)  which  are 
on  the  longest  paths.   Remove  the  scheduled  nodes  from  the  tree  and  start  over 
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Figure  3.1.      The  Original  Graph 


level  k 


level  3 


level  2 


level  1 


Figure  3*2.  A  Relaxed  Graph  With  Four  Levels  of  Nodes 
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0 


Figure  3.3 •  A  Relaxed  Graph  Representing  Released  Matrix 

with  the  new  tree.   The  procedure  followed  is  described  in  Kraska's  paper  [5]. 
The  processor (s)  with  the  minimum  amount  of  activity  will  be  assigned  first. 

When  JCOMP  has  finished  establishing  the  order  of  node  processing, 
it  prints  out  the  nodes  according  to  the  original  node  numbering  system  rather 
than  according  to  the  relaxed  numbering.   Thus,  the  initial  input  to  WSCHED 
has  the  same  node  numbering  as  the  final  output  of  WSCHED,  the  scheduled  lists. 

JCOMP  is  an  internal  procedure  in  WSCHED.   It  receives  the  number  of 
processors  for  which  it  should  schedule,  the  number  of  levels  in  the  matrix, 
and  the  number  of  nodes  in  the  operations  graph.   It  returns  a  code  of  1  to 
signal  an  unsuccessful  attempt  and  a  code  of  0  for  success  in  scheduling  the 
nodes  on  the  number  of  processors  provided.   JCOMP  includes  a  switch  which 
determines  whether  the  LUB  is  to  be  found  or  a  straight  scheduling  is  to  be 
done. 

The  input  for  the  scheduler  involves  only  assignment  statements, 
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and  each  of  these  begins  with  a  series  of  variable  fetches—one  of  the  fastest 
operations.   Therefore,  if  the  stage  of  having  only  root  nodes  (fetches)  left 
unscheduled  is  reached,  then  they  may  be  scheduled  in  any  order  as  long  as  the 
scheduling  does  not  increase  the  critical  path  length  of  the  LUB  calculation. 
If  other  operations  with  weights  greater  than  that  of  fetches  were  to  be  in- 
cluded at  the  root  level  in  LUB  calculation,  then  Johnson's  algorithm  [3] 
would  have  to  be  implemented  to  perform  the  final  phase  of  scheduling. 
Johnson's  algorithm  considers  the  amount  of  time  left  to  each  processor  before 
minimal  time  is  exceeded.   Start  with  the  PE  with  the  least  amount  of  time  left. 
Find  an  unscheduled  operation  or  a  set  of  unscheduled  operations  which  occupy 
the  PE  the  longest  without  exceeding  its  time  left.   Then  proceed  to  each  of 
the  other  processors  in  order  of  time  left,  so  the  PE  with  the  most  time  left 
is  scheduled  only  if  there  are  still  unscheduled  operations  left  to  be  done. 
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k.      EXPERIMENTS  AND  RESULTS 

k.l     Background  Discussion  of  the  Experiments 

Let  us  discuss  two  differences  between  weighted  and  unweighted  nodes. 
We  want  to  see  how  weights  affect  the  scheduling  of  operations  and  how  the  two 
types  of  nodes  are  affected  by  the  distribution  process. 

Consider  the  expressions: 

(k)      C  /  A  +  D  +  B  *  (A  +  E) 

(5)  c/a  +  d  +  b*a  +  b*e 

For  the  unit  weighted  case,  statement  k   can  be  calculated  in  3  time  steps  by 
2  PEs,  assuming  that  the  operands  are  already  available.  For  the  weighted 
case,  with  the  weights  as  defined  in  section  2.3,    statement  k-   would  be  done  in 
9  time  units  by  2  PEs.  However,  their  scheduling  process  is  different  from  the 
one  with  unit  weights  (see  Figures  k.l   and  k.2).      Since  statement  5  would  not 
be  executed  faster  than  statement  k   and  it  would  take  an  extra  PE  in  both  the 
weighted  and  the  unweighted  case  (see  Figures  k.3   and  k.k),    distribution  would 
not  be  done .  Note  that  the  operations  performed  are  in  different  sequences . 
In  the  weighted  case  one  of  the  PEs  only  has  to  do  the  divide  operation  while 
the  other  PE  is  performing  two  adds  and  a  multiply. 
Let  us  look  at  another  set  of  statements: 

(6)  A  +  D  *  (B  +  C  +  E) 

(7)  A  +  D*B  +  D*C+D*E 

As  we  see  in  Figures  k.5   and  k.6,    statement  6  requires  1  PE  in  both  cases. 
The  weighted  case  takes  9  time  units;  the  unweighted  case  takes  k.      Notice  that 
the  scheduling  is  the  same  for  both  the  weighted  case  and  the  unweighted  case. 
Once  we  distribute  D,  however,  as  shown  in  Figures  ^.7  and  k.Q,    the  scheduling 
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changes  and  the  time  shortens  with  3  PEs  "being  used  in  both  cases.  Therefore, 
we  would  proceed  with  the  distribution  process  in  both  cases. 
The  last  set  of  expressions  to  be  considered  are: 


(8)  ((A  +  B)  *  C  *  D)  *  E 

(9)  (A  +  B)  *  C  *  D  *  E 


Figure  k.l.      Unweighted  Case 


7  +  2  =  9 


Figure  k.2.     Weighted  Case 


C/A  +  D   +   B  *  A  +  B  *  E  timing 

V     /  \/      V 

*      1 


Figure  1|- -3  -  Unweighted  Case  Distributed      Figure  k.h. 


Weighted  Case 
Distributed 
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A  +  D*(B  +  C  +  E)      timing 


A     +     D       *        (Bv  +     C  +  E)      timing 

V   / 


V 


Figure  k-5-     Unweighted  Case 


Figure  k.6.     Weighted  Case 


A  +  D*B  +  D  *  C  +  D  *  E  timing      A  +  D*B  +  D*C  +  D*E  timing 

V    v    v-  \    V    V    V 

#        #       *       1  \  ~> 

V  V.      V 

»■         5 


Figure  k.^.     Unweighted  Case 
Distributed 


Figure  k.Q.     Weighted  Case 
Distributed 
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((A  +  B)      *      C  *  D)    *  E       timing 

V        v 

*        /  1 


*     C  *  D)      *     E       timi 


ing 


Figure  4.9.  Unweighted  Case 


Figure  4.10.   Weighted  Case 


(A^+  B)    *   C.  *     D  v  *7  E     timing 


V 


(A  +  B)        *      C     *D*E     timing 

v    /   V  . 


Figure  4.11.  Unweighted  Case 
Distributed 


Figure  4.12.  Weighted  Case 
Distributed 


This  time  distribution  will  be  done  for  the  weighted  case,  but  it 
will  not  be  done  for  the  unweighted  case.   This  discussion  bears  out  our 
previous  observation  that  the  distributed  case  tends  to  require  more  PEs  than 
the  undistributed  case,  but  it  may  have  shorter  execution  time. 

If  subexpressions  are  evaluated  only  once  and  the  results  are  used 
whenever  the  corresponding  subexpressions  appear,  the  number  of  processors 
needed  will  usually  decrease;  but  the  number  of  levels  in  the  operations  tree 
will  remain  constant.   The  reason  for  this  is  apparent.   If  a  set  of  operations 
has  to  be  done  in  several  places,  there  will  be  PEs  at  each  level  working  on 
the  same  operands  producing  the  same  results  that  some  other  PE  at  that  level 

; 

is  also  producing.   Therefore,  a  greater  number  of  PEs  would  be  required  than 
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if  we  were  to  route  the  previous  results  to  wherever  that  subexpression  is 
needed  as  an  operand.   Since  we  have  already  found  that  several  processors 
executing  the  same  set  of  operations  take  the  same  amount  of  time  as  one 
processor  executing  that  set  of  instructions,  the  time  needed  to  compute 
remains  constant. 

To  verify  these  ideas,  two  sets  of  experiments  were  run.  The  first 
set  compares  the  effects  of  distribution  on  weighted  and  unweighted  nodes  for 
the  same  statements.  The  statements  used  are  similar  to  our  previous  examples, 
The  second  set  applies  WSCHED  to  two  versions  of  a  polynomial  of  degree  30. 
The  connection  graph  of  the  first  version  requires  each  instance  of  common 
subexpressions  to  be  evaluated  separately.   Therefore,  this  graph  is  a  tree. 
For  the  second  version  communication  between  processors  is  allowed,  i.e., 

common  subexpressions  are  computed  once,  and  the  results  are  routed  to  wher- 

5 
ever  they  are  needed.   For  example,  the  value  of  x  is  calculated  only  once; 

-1-2  O  c 

and  x   may  be  calculated  by  using  x  and  x  . 

K.2     Discussion  of  the  Distribution  Results 

Six  FORTRAN  statements  (Figure  if-.  13)  were  run  through  WSCHED  both 
with  weighted  operators  and  unweighted  operators.   The  results  are  shown  in 
Table  1  and  Table  2.   These  statements  were  chosen  to  show  various  distri- 
butions as  well  as  how  the  distribution  process  can  decrease  the  critical  path 
length  and  how  it  affects  the  number  of  processing  elements.   The  entries  in 
the  E  do   idle  time)  columns  have  been  rounded  to  the  nearest  percent  to  avoid 
clutter. 

Statements  A,  B,  E,  and  F  were  put  through  the  distribution  process 
in  both  the  weighted  case  and  the  unweighted  case.   For  these  four  statements, 
the  number  of  levels  decreased  (the  tree  height  was  reduced) ;  and  the  amount 
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Set  #  Statement 

A  BA1=A*(B*C*D+E) 

B  BA2  =  A  *  (B  *  C  +  D)  +  E 

C  BE  =  (A  +  B  *  C  *  D)  *  (E  +  F) 

D  BE  =  ((A  +  B)  *  C  *  D)  E 

E  BE  =  (A  *  (B  *  C))/D  *  (E  *  F  *  G) 

F  BE  =  (((A  *  (BC))  *  (DE))/F)  *  G 

Figure  l+.lj.   The  6  FORTRAN  Statements 

of  time  required  to  completely  process  each  statement  was  cut.  However,  the 
number  of  processing  elements  increased.  These  results  are  as  expected  from 
our  previous  discussion. 

Consider  statement  C.   It  should  go  through  distribution  in  the 
weighted  case  since  clock  time  decreases  while  the  tree  height  remains  constant 
and  the  number  of  processors  required  increases.  However,  in  the  unweighted 
case  no  distribution  would  be  done.  Not  only  does  the  number  of  levels  remain 
constant  (and,  hence,  since  this  is  the  unit  weighted  case,  the  clock  time 
remains  constant),  but  the  number  of  processors  increases  causing  a  greater 
percentage  of  idle  time  on  the  processors  used. 

In  statement  D  we  have  the  case  that  distribution  is  no  help  at  all 
in  the  unweighted  case  as  far  as  number  of  levels  and  time  are  concerned. 
However,  with  distribution  we  see  a  decrease  in  the  number  of  PEs,  thereby 
decreasing  idle  time.   So  distribution  should  be  done  if  we  are  interested  in 
conserving  machine  power.   In  the  weighted  case  not  only  is  the  number  of 
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Set  # 

No.  of  levels 

Distribute? 

No.  of  PEs 

Time 

Z  $  idle 

time 

A 

6 

no 

2 

13 

69 

A 

5 

yes 

k 

10 

190 

B 

6 

no 

2 

12 

67 

B 

5 

yes 

3 

10 

100 

C 

6 

no 

2 

13 

i+6 

C 

6 

yes 

3 

12 

75 

D 

5 

no 

k 

11 

lh5 

D 

5 

yes 

3 

10 

130 

E 

6 

no 

3 

16 

125 

E 

5 

yes 

5 

13 

2*+6 

F 

7 

no 

2 

19 

52 

F 

5 

yes 

5 

13 

2k6 

(E  %  idle  time  is  summed  over  the  number  of  processing  elements  used) 


Table  1.   For  Weighted  Nodes 
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Set  # 

No.  of  levels 

Distribute? 

No.  of  PEs 

Time 

Z  $  idle  time 

A 

6 

no 

2 

6 

33 

A 

5 

yes 

k 

5 

160 

B 

6 

no 

2 

6 

33 

B 

5 

yes 

k 

5 

180 

C 

6 

no 

3 

6 

100 

C 

6 

yes 

If 

6 

150 

D 

5 

no 

k 

5 

200 

D 

5 

yes 

3 

5 

100 

E 

6 

no 

U 

6 

167 

E 

5 

yes 

6 

5 

320 

F 

7 

no 

3 

7 

100 

F 

5 

yes 

6 

5 

320 

u 

N 
W 
E 
I 
G 
H 
T 
E 
D 


Table  2.  For  Unweighted  Nodes 
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processors  decreased,  but  time  is  also  shortened,  so  distribution  should  be 
done.  This  is  the  only  statement  in  the  group  of  six  in  which  distribution 
causes  the  amount  of  idle  time  to  decrease. 

In  this  set  of  experiments,  two-thirds  of  the  cases  yielded  higher 
amounts  of  idle  time  for  the  unweighted  case  than  for  the  weighted  case.   This 
is  due  to  the  fact  that  the  number  of  processors  needed  by  the  unweighted  nodes 
is  not  less  than  the  number  of  processors  needed  by  the  weighted  case.   The 
reason  is  simple  to  see.   If  we  restrict  processors  to  writing  for  fellow  PEs, 
no  matter  how  long  any  one  of  them  takes,  we  waste  process  time.   In  Figure 
J+.10,  the  processor  executing  A  +  B  had  to  wait  for  the  results  from  the 
processor  calculating  C  *  D  since  multiplies  take  longer  than  adds  in  our 
example.   This  creates  a  time  slot  during  which  the  former  processor  is  un- 
occupied.  There  was  also  a  6 -unit  time  hole  while  the  E  was  waiting  to  be 
multiplied  for  a  total  of  7  time  units  unoccupied.  After  distribution  in 
Figure  4.12,  there  are  still  two  holes;  however,  they  total  up  to  only  k-   time 
units  unoccupied.  We  notice  that  the  A  +  B  processor  could  switch  right  into 
multiplying  (A  +  B)  by  C  without  waiting  for  D  *   E  to  be  computed.  A  dis- 
cussion on  time  holes  may  be  found  in  Kraska's  paper  [5] • 

As  expected,  the  number  of  PEs  required  by  the  unweighted  case  is 
greater  than  or  equal  to  the  number  of  processors  needed  by  the  weighted.   The 
argument  is  presented  above.  A  processor  is  not  tied  up  waiting  for  another  PE 
to  finish.   It  is  allowed  to  proceed  to  another  operation.   If  this  operation 
is  on  the  same  level  as  the  one  the  PE  just  finished,  it  is  then  taking  the 
place  of  another  PE  so  that  we  could,  for  example,  do  with  one  less  processor 
for  the  weighted  case  than  for  the  unweighted  case. 
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4 .3  Background  for  the  Second  Set  of  Experiments 

Let  us,  for  simplicity's  sake,  first  consider  a  polynomial  of 
degree  7  "with  unweighted  nodes  "before  tackling  the  polynomial  of  degree  30. 
Muraoka ' s  folding  method  [10]  factors  out  powers  of  x  which  are  Fibonacci 
numbers,  and  the  subexpression  left  after  factoring  is  in  the  form  of  the  sub- 
expression to  the  left  of  it.   For  example,  consider 


23^+567 
(10)  Pr7(x)  =  a^  +  a_.x  +  a^x  +  a^x  +  a,  x  +  a^x  +  arx  +  a^x1 
v/7y012  3  ^  5  6  7 


which,  using  Muraoka 's  method,  is  folded  into 

2  3  2   5 

(11)   P7(x)  =  a  +  a  x  +  apx  +  (a  +  a,  x)  yc    +  (a  +  a^x  +  ax  )  x 

2 
(see  Figure  k.  14  for  the  operation  tree).   In  this  example,  a  +  a^-x  +  ax  is 

2 
in  the   same   form  as  a     +  a  x  +  ax   ;    and  the  former   subexpressions  take  no 

longer  to  calculate  than  the  latter.   The  highest  powers  of  x  in  the  equation 

can  be  calculated  by  multiplying  two  previously  calculated  powers,  e.g., 

5    2    3 
x  =  x  *  x  ,  the  property  of  Fibonacci  numbers. 


a  +  a  x  +  a  x  +  (a  +  a,,  x)  x  +  (a  +  a^x  +  a„  x  ) 


2n  „5 


Figure  4.14.  Graph  for  P  Using  Common  Results 
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With  data  transmission  between  processors,  as  shown  in  Figure  k-.lk, 
Kraska  [6]  proposed  that  no  more  than  nw  +  nw  +  2.75  ln(n)  +  1  arithmetic 

operations  are  needed  to  compute  Pn,  a  polynomial  of  degree  n  where  w  is  the 

a 

weight  of  an  add  node  and  w  is  the  weight  of  a  multiply  node.   Thus,  for  P„, 
to  m  7 

the  upper  bound  is  computed  as  being  25  arithmetic  operations,  9  fetches,  and 

1  store.   In  actuality,  P  requires  17  arithmetic  operations,  9  fetches,  and 


1  store. 


Now  let  us  assume  that  each  common  subexpression  must  be  recalculated. 


Then  the  P  tree  becomes  the  following: 


store 


Figure  4.15-   P7  Tree  Without  Common  Results 


By  counting  the  nodes  in  this  tree,  we  find  that  there  are  21  arithmetic  nodes. 
However,  for  polynomials  with  large  degrees  it  would  be  desirable  to  have  an 
easier  method  for  calculating  the  number  of  nodes  rather  than  having  to  count 
them  up  after  forming  the  trees. 
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Therefore,  we  begin  "by  noting  that  for  a  polynomial  of  degree  n 
there  are  n  +  1  coefficients  and  an  x  to  he  fetched  plus  1  store.  Then  we 

note  that  x  may  be  calculated  in  [log_n]  +  N  -  1.  where  N  is  the  number  of 

2      n  n 

l's  in  the  binary  representation  of  n,  [q]  represents  the  largest  integer  not 

greater  than  q  [k] .   From  this,  we  find  that  to  compute  a  polynomial  of  degree 

n,  using  Muraoka ' s  folding  method,  we  need  exactly  0  arithmetic  operations, 

where  0  is  defined  recursively  as 
n 

(12)   0n  =  0n_x  +  [log2k]  +  Nk  +  1 

where  k  =  n  -  .ET  F.,  F.  are  elements  of  F.,  which  is  the  set  of  Fibonacci 

numbers  which  are  factored  out  of  the  a  a  term.   In  other  words,  k  is  the 

n  ' 

highest  power  of  x  remaining  inside  the  parentheses.   For  example,  in  P  ,  k  =  2. 
By  equation  12,  P  requires  0  =  l8  +  1  +  1  +  1  =  21  arithmetic  operations, 
which  agrees  with  our  node  count  in  Figure  4.15. 

k.k     Results  of  the  Second  Set  of  Experiments 

Returning  to  the  polynomial  of  degree  30  with  unit  weighted  nodes, 
Kraska's  upper  bound  is  calculated  to  be  70  arithmetic  operations.  These  70 
plus  32  fetches  and  1  store  yield  a  total  of  103  nodes  in  the  graph  for  P*n« 

In  reality,  there  are  99  nodes.  Without  using  common  results,  there  are  133 

nodes  calculated  from  (12)  which  agrees  with  the  node  count  done  on  the  tree. 

We  formed  the  graph  and  the  tree  of  P_.  as  we  have  done  here  for  P_,  and  it  is 

30  ( 

with  these  that  we  verified  the  above  results. 

Putting  these  two  versions  of  P   through  WSCHED  with  unweighted  nodes 

produced  the  following  results.   Both  had  10  levels  of  operation.   The  tree 
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(133  nodes)  required  27  PEs  to  complete  calculation  while  for  the  graph  (99 
nodes),  l8  processors  sufficed. 

The  results  bear  out  intuitive  conjectures.  Graphs,  not  requiring 
as  many  operations,  should  need  fewer  processors.  However,  since  the  powers 
of  x  still  must  be  calculated,  our  previous  discussion  stipulates  that  with 
extra  processors  the  time  required  to  perform  one  set  of  calculations  is  equal 
to  the  time  required  to  execute  several  copies  of  the  set  of  calculations  in 
parallel. 

In  our  discussion  of  data  transmission,  we  had  not  mentioned  any 
time  allotted  to  the  routing  of  information.  That  is,  because  in  our  ex- 
periments we  did  not  take  it  into  consideration.   It  should  be  very  interesting 
to  see  how  the  added  data  transmission  time  would  affect  our  results. 

It  would  be  interesting  to  compare  the  amount  of  time  taken  to 
prepare  and  execute  a  FORTRAN  program  using  our  scheduler  (with  optimization 
done  by  back  substitution,  recursion,  and  tree  height  reduction)  with  the 
time  taken  by  the  FORTRAN  code  which  is  optimized  by  hand  plus  the  preparation 
time. 

Duncan  Lawrie  [9]  has  started  such  an  experiment  for  unit  weighted 
nodes.  The  results  are  as  yet  unavailable. 

Another  question  time  did  not  allow  us  to  investigate  is  the  original 
question.   Does  the  amount  of  time  saved  at  execution  speed  up  on  our  parallel 
processor  machine  as  compared  to  execution  on  a  single  processor  justify  the 
amount  of  time  we  needed  to  prepare  programs  for  execution  on  our  system?  In 
other  words,  does  the  program  spend  less  total  time  on  our  parallel  system  than 
on  the  single  processor? 

We  see  that  a  program  has  been  written  which  implements  a  compile- 
time  operation  scheduler  for  parallel  processors.  We  have  used  It  to  investigate 
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the  distribution  process  and  the  tree  versus  graph  question.   The  results  of 
the  investigations  support  the  theories  we  have  presented. 
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APPENDIX  A 


N 


INCM 


NW 


nt 


Parameter  to  WSCHED,  the  number  of  nodes  in  the 
graph  including  fetches  and  stores. 

Parameter  to  WSCHED,  the  connectivity  matrix 
representing  a  graph  with  N  nodes.   It  is  a 
N  x  N  BIT  (1)  matrix. 

Parameter  to  WSCHED,  the  weight  vector  where 

NW  (i)  is  the  weight  of  node  I.   It  is  a  H  vector 

of  integers. 

Parameter  to  WSCHED,  the  type  vector  where  NT  (i) 
represents  the  operator  involved  at  node  I.   If  the 
operation  is  a  fetch,  NT  (i)  specifies  which  variable 
is  to  be  fetched. 


SCHEDPE 


SCEAP 


TIME 


STEPZ 


External  vector  with  a  maximum  of  5  elements. 
SCHEDPE  (1)  =  0  indicates  that  WSCHED  is  to  find  the 
least  upper  bound  on  the  number  of  processors  needed 
to  perform  all  the  operations  represented  by  the  graph 
in  minimal  time.   SCHEDPE  (l)  =  k,  k  a  positive 
integer  less  than  5>  indicates  the  number  of  straight 
schedulings  to  be  made  for  the  numbers  of  processors 
specified  by  SCHEDPE  (2)  through  SCHEDPE  (k  +  l) . 

Indicator  based  on  SCHEDPE  (l) .   SCRAP  =  1  means  to 
calculate  the  least  upper  bound,  SCRAP  =  0  means  to 
schedule  the  graph  using  the  number  of  PEs  specified 
in  SCHEDPE  (i). 

Saves  the  timer  results  so  that  we  know  the  elapsed 
time  from  part  to  part. 

A  function  provided  by  the  system  library  which 
returns  the  amount  of  time  remaining  before  the  task's 
time  limit  is  exceeded. 


TABLE 


LVL 


IPTR 


BACK,  PTR 


B 


A  matrix  which  for  the  levels  of  operation  in  the 
relaxed  graph  indicate  the  number  of  occurrences  of 
each  of  the  6  operators  (including  fetch  and  store) . 

A  vector  such  that  LVL  (i)  points  to  the  node  which  is 
the  first  on  level  I,  the  terminal  level  being  level  1. 

Vector  of  N  pointers,  IPTR  (i)  =  J  indicates  that 
relaxed  graph  is  node  J  in  the  original  graph. 

Work  vectors  which  save  pointers  during  the  relaxation 
process. 

N  x  N  BIT  (l)  matrix  used  in  the  relaxation  process. 
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The  number  of  levels  of  operation  in  the  relaxed  graph. 

The  critical  path  length. 

The  sum  of  weights  up  through  the  levels. 

The  ratio  of  IP  to  ICWM. 

The  maximum  ratio . 

The  greatest  lower  bound  on  the  number  of  processors 
needed  to  compute  in  minimal  time.   It  is  the  least 
integer  not  less  than  EMAX.   It  is  later  the  number  of 
processors  to  be  scheduled  upon  in  JCOMP. 

Success  (0)  or  failure  (l)  indicator  from  JCOMP. 

A  procedure  internal  to  WSCHED  which  develops  the 
schedule . 

Parameter  to  JCOMP,  same  as  N  in  WSCHED. 

Parameter  to  JCOMP,  the  success-failure  indicator. 

Parameter  to  JCOMP,  the  number  of  processors  on  which 
to  schedule. 

Parameter  to  JCOMP,  the  number  of  levels  in  the  graph. 

The  number  of  nodes  assigned  so  far. 

The  number  of  nodes  previously  assigned. 

Indicator  of  nodes  having  been  denied  assignment. 

A  N  x  K  matrix  the  jth  column  of  which  is  the  assign- 
ment list  for  the  jth  processor. 

T  (i)  represents  the  longest  path  length  from  node  I 
to  a  root  node. 

DEPTH  (I,  l)  is  the  number  of  nodes  in  list  I. 
DEPTH  (I,  2)  is  the  length  of  list  I  calculated  from 
the  weights  of  the  nodes  in  list  I. 

An  operation's  completion  time  if  its  node  were 
assigned  now. 

The  critical  path  length. 

The  set  of  terminal  nodes  for  the  graph  being  considered. 


3h 


R 

B 

P 

LP 

NOTN 

NTNA 

RESTOR 

S 

SP 

NS 

NSP 

SAVEM 

L 

PP,  LPP,  PW,  LL,  PAS 

ASIGN 

LU 

U 

X 

NX 

MJ 

LV 

WSJ 

LD,  DELTA 


Q  plus  the  nodes  which  would  become  terminals  after 
this  time. 

N  x  N  BIT  (1)  matrix  which  is  the  relaxed  connectivity- 
matrix  workarea. 

The  processors  whose  lists  have  minimal  length. 

The  number  in  P. 

The  cardinality  of  Q. 

The  cardinality  of  R. 

Matrix  saving  columns  of  B. 

The  set  of  nodes  with  max  (W.  +  T.)  i  in  Q  or  E. 

11 

S  n  Q. 

The  cardinality  of  S. 

The  cardinality  of  SP. 

A  save  matrix. 

The  node  under  consideration. 

Save  variables. 

The  list  of  nodes  which  have  been  assigned  thus  far. 

min  {¥.  -  m.    I  n.   e  (S  -  SP)}  . 

fn.  |  w.  -  mr.  =  lui. 

{i  |  Ti  <  LU). 
Cardinality  of  X. 
Cardinality  of  U. 
LP-NX  +  NU  (0  if  S-SP  =  t) . 
max  {NW.  |  N.  e  SP] . 

Indicate  if  we  can  still  schedule  without  exceeding 
the  critical  path  length. 


All  variables  not  listed  here  are  used  only  in  the  capacity  of  work  areas 
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APPENDIX  B 


(SUBSCRIPTRANGFtSTRINGPANGE)  : 
TEST:PRGC    OPT  I CNS( MA  IN  )  ; 

DCL    SCHEDPE(5)     FIXEO    B IN( 15 )     EXTERNAL; 
GET    LIST(L > ; 
B:    BEGIN; 

DCL    M(L,L)     BIT( 1) ,NW(L ),NT(L)  ; 

M^O'B;  /*     INITIALIZING    THE    WHOLE    MATRIX    TO    0      */ 

GET    DATA(M); 
GET    LIST(NW,NT  )  ; 
ON    ENDFILE    GC    TC    ZFPC; 

GET    LIST(SCHEDPE( 1)1; 

GET    LISTM  SCHEDPEI  I)     DC    1=2    TO    SCHEDPE  ( I)  +  1)  )  ; 
GO    TO    CALL*; 
ZERO:     SCHECPEt 1)=0;  /*    NO    FORCED    SCHEDULING    */ 

CALLW:    CALL    W SC HED( L, M , NW ,NT ) ; 

PUT    SKIP    LISTMTEST    FINISHED1); 
END    B; 
END    TEST; 
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(SUBSCRIPTRANGE): 

WSCHED:    PRCC(N,INCP,NW,NT); 

DCL    SCHEDPE(5)    FIXED    BIN(15)    EXTERNAL; 
DCL    TIME    FIXED    BIN<31),     STEPZ    ENTRY    RETURNSC  FIXED    BINOlll; 
DCL    L,I , J,K,M,N,KK,RATIG,RMAX,NP,ICWM; 
DCL     INCM(N,NIBIT<1  ),NW(N), IPTR ( N I , NT< N  I  ; 
DCL    TABLE(30,6 IFIXED    BIN; 
DCL    LVL<0:N) ; 

DCL    SCRAP    BITCH, KCDE    BIT(l); 
/*    SCHEDPE(l)     IS    THE    NUMBER    OF    SETS    OF    PE'S    TO    BE    DONE 
SCHEDPE(2I    TO    SCHEDPE< SCEDPE < 1 1  +  1 1    CONTAINS 
THE    NUMBER    OF    PE'S    OF    EACH    SET      */ 
/*    SCRAP    =    1     INDICATES    CALCULATION    OF    L.U.B.    OF 

THE    NUMBER    OF    PROCESSORS    NEEDED   TO    DO    THE    MATRIX 
IN    THE    LEAST    AMOUNT    OF    TIME.       SCRAP    =    0    MEANS    TO 
JUST    SCHEDULE    THE    MATRIX    USING   THE    NUMBER    OF    PE»S 
WHICH    IS    READ    IN    AS    DATA  */ 

/*    RIGHT    NCWf    THE    LISTS    MAY    NOT    EXCEED   32767    IN    LENGTH 
TO    CHANGE    THIS,    ALL    THE    FIXED    BINC15)     VARIABLES    WILL 
HAVE    TO    BE    CHANGEO    TO    FIXED    BIN! 311    AND    CERTAIN   OF 
THE    FORMAT    STATEMENTS    WILL    HAVE    TO    BE    ALTERED    ALSO    */ 
/*    LEVEL    RECOGNITION:    THE    ALGORITHM       IS    NECESSARY    TO 

PRODUCE    A    RELAXED    GRAPH    (    MATRIX!    -   HOWEVER,     IT    MUST 
BE    NOTED    THAT    THIS    PROCEDURE    INVOLVES    ORDER    OF    N 
SQUARED    CALCULATIONS    AND    HENCE    VERY    SLOW         */ 
TIME=STEPZ; 

IF    SCHECPE(1)=0    THEN    SCRAP=1B;    ELSE    SCRAP=OB; 
DO    l=i    TG    N; 

PUT    SKIP    ECIT((INCM(  I, J I    DO    J=l    TO    N I  I ( <  1001 B<  1 1  1 ; 
END; 

PUT    SKIP    EDIT(  (NW(II,NT(II    DO    1=1    TO    N ) I ( < 60  IF ( 3  I  I ; 
DO    1=1    TC    N; 
IPTR (1 1  =  1 ; 
END; 

/*    LAST    LEVEL    IS    ROWS    WITH    ALL    ZEROS    -    GETTING    LAST    LEVEL   * 
K,LVL(OI=N+l; 
Bl:    BEGIN; 

DCL     (PTR(N),BACK(NI»FIXED    BIN; 
DCL    MS,MT,  e<N,N|BIT(l); 

/*    PTR    SAVES    A    LEVEL    OF    POINTERS,    BACK    SPECIFIED    WHICH    NODE 
REFERENCED    NCDE     I       */ 
PTR,BACK=IPTR; 
DO    I=N    TO    1    BY    -1 ; 
IF    ANY< INCMCI ,* I )=«Of 8    THEN    DO; 
K=  K~ 1  * 

IF    K-*=    IPTR(  I)     THEN    DO; 
MT=IPTR(K)  ; 
MS,IPTR(KI  =  IPTR(  II  ; 
BACK(MT)=MS; 
IPTRCI )=MT; 
BACKC I l=K; 
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/*    END    DG    LOOP    */ 


END; 
END; 
END; 

PTR=IPTR; 

M=l; 

IF    K=l    THEN    GO    TO    BLK; 
KK,LVL( 1)=K; 
/*    LAST    LEVEL    GOTTEN,    PROCEED    WITH    ALL    THE    OTHERS    */ 
JO:    DO    J=LVL(M-1I-1    TO    K    BY    -1; 
NP=IPTR( J) ; 
DC    I=N    TO    1    BY    -1; 
MT=BACK< I ); 

IF    MT<K    I    INCM(  I,NP)=1B    THEN    DO; 
DO    L=J-1     TC    1     BY    -IS 

IF     INCM{  I,  IPTR(L ))=1B    THEN    GO    TO    J2; 
END; 

KK=KK-1 ; 

IF    KK-.  =  MT    THEN    DO; 
PS,PTR(MT) =PTR (KK) ; 
PTR(KK  )=I; 
BACK( MS>=M; 
BACK( I )=KK; 
END; 
ENDS 
J2:    FND; 
END    JO; 

IPTR=PTR;  /*    SET    NEW    POINTERS    */ 

M=M+1;  /*    GOT    NEXT    LEVEL       */ 

IF    K=KK    THEN    GO    TO    BLK; 
K,LVL(M) =KK; 
J3:     IF    K>1    THEN    GO    TO    JO; 
BLK:LVL(M)  =  1  ; 
/*    GET    THE    RELAXED    MATRIX 
DO    1=1    TO    N; 
fi(  PACK(  N,*)=INCM(I  ,*>  ; 
FNC; 

DC    1=1     TC    N; 
INCM(*,RACK(  I)  )  =  B(*,I)  ; 
END: 

DO    1=1    TO    N; 

PUT    SKIP    EDIT(  (INCMU  ,J)     DO    J=l     TO    NM 

((100) flllll; 
END; 

PUT    SKIP    DATA( IPTR,BACK) ; 
END    Bl  ; 

/*     M    LEVELS    ESTABLISHED,     INDEX     IN    LVL    -    LAST    LEVEL    FIRST 
NCW    CCMPUTE    THE    SUMS    AND    RATIOS    OF    THE    WEIGHTS    AND 
THE    PATHLENGTHS       */ 
BLOCK:    BEGIN; 


/*     GET    NEXT    LEVEL       */ 
*/ 
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DCL     IL,JL,LL,IKW,IP,RMAX,LEV,ICW(N); 
TABLE=0; 

ICW=0; 
ip=o; 

ICWM=0; 

rmax=o; 

DO    LEV  =  M    TO    1     BY    -1; 

LL=LVL(LEV-1)-1; 

DC    IL=LVL<LEV)    TO    LL; 

K=NT(  IPTR(  ID)  ; 
IF    K<11    THEN    DO; 
/*    THEN    IT    IS    ONE    OF    THE    FOUR    OPERATORS    */ 

K=K-3; 

TABLE(LEV,K)=TABLE(LEV,K)+l; 

END; 
ELSE    IF    K=ll    THEN    TABL  E  (  LEV,  5  )=T  ABLE  (  LEV,  5  Kl ; 
/•*    IT    WAS    A    STORE     INSTRUCTION   */ 

ELSE    TABLE(LEV,6 )=TABLE(LE  V,  6  >+l ;    /*    A    FETCH    */ 
ICW(IL)=NW(IPTR(IL) ); 
IF     ICWM<ICW(IL)    THEN    ICWM= IC W( IL ) ; 
IP  =  IP+ICkUL); 
END; 

DO    IL=LVL<LEV)-1    TO    1    BY   -1; 
IKW=0; 

DO    JL=IL+1    TO    LL; 
IF    INCMUL,  JU=,lt B    THEN 

IF    ICW(JL)>IKW    THEN    IKW=ICW(JL); 
END; 

IK«=NW( IPTR(  IL  )  )*IKW; 

IF    IKWMCWUL)    THEN    ICW(IL)=IKW; 

IF    ICWUDMCWM    THEN    I CWM=ICW( I L ) ; 

END; 

RATIO=IP/ICWM; 

PUT  SKIP  DATA(RATIO) ; 

IF  RMAX<RATIC  THEN  RMAX=RATIO; 

END; 

/*    NP    IS    THE    NUMBER    OF    PROCESSORS    NEED    TO    DO    THE    MATRIX    IN 

THE    SHORTEST    AMOUNT    OF     TIME  */ 

PUT    SKIP<2>    LISTt'THERE    ARE  •  ,M,  •  LE  VELS  «  ) ; 
PUT    SKIP    DATA((LVL(I)    DO    1=1    TO    M)); 
NP=CEIL(RMAX); 
PUT    SKIP    EDIT(  'NUMBER    OF    PROCESSCRS    NEEDED    FOR    •, 

•CONNECTION    MATRIX     •  ,NP )  ( A, A ,F{ 6 ) ) ; 

PUT    SKIPC2)    LISTC  THE    TABLE    OF    TYPES    :•); 

PUT    SKIP    EOIT<«LEVEL», •♦•,»-•, •*•,•/•, 'STORE*, 
•FETCH*  )  <    A,X<7),  A,X<  10)  ,A,X(9),A,X<  10),A,X(8)  , 
A,X(5)  ,A); 

PUT    SKIP(O)     EDIT((61 )•_• )(X(7)  ,A); 

CO    1=1    TC    f; 
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PUT    SKIP    E 
(F<5) 
END; 
END    BLOCK; 

PUT    SKIP    EDIT( 


IF     S 

NP= 

CAL 

END 

ELSE 

AGA 

IF 

NP 

GG 

EN 

END; 

JCCM 

PUT 

/* 

DCL 


CRAP=OB    THEN    D 

SCHEDPE(  I)  ; 

L    JCCMP(N,KODE,NP,M)  ; 


CIT(I,»     | •  ,(TABLE(  I, J)    DO    J=  1    TO    6)) 

,a,  <6)F(ion; 


•TIME    ELAPSED    =     • ,T I ME-ST EPZ , ■     CENTI SECONDS* ) 

(X(70) ,A,F(16)  ,A); 

C    1=2    TO    SCHEDPE(1H-1; 


DO; 
IN:    CALL    JCOMP 

KODE=» I'B    THE 
=NP+l; 

TO    AGAIN; 

D; 


(N,KODE,NP,M) ; 
N    DO; 


P:     PR 

SKIP 

DECLA 

(REST 

PW(N) 


CCL 
DCL 
DCL 
/* 

/  + 

AS  =  0 
PAS= 
DENY 
PP  =  0 
LI  ST 
I*     R 

/* 

T=o; 

PUT 

/* 
/' 

Pi: 


<LU,M 
RATIO 
T(N) 

TIME  = 
STFP 


OC(N,KODE 

LIST(»N    £ 

RAT  IONS 

OR(N,K)  ,B 

,LPP,PP(K 

LM,M,AS» 

NS,NSP,L 

P(K)  ,Q(N 

MI »WSJtD 

INIFIXED 

fSAVFHN* 

FIXED    BIN 

*/ 
STEPZ; 
41    +/ 


F,K,LM  ); 

K« ,N,K) ; 

*/ 
(N,N>)     BITU),     (DI (N) ,LIST (N,K  ),DEPTH<K,2) ,R(N), 
If 

ASIGN(N),     MN,MJ, 

DtC,LV,NSPCtLWD,REST(K) ,MStNU,NX,U(N) ,X(N) , 
),S(N),SP(N),W(N)»LP,I ,JfK, L,NOTN,NR,BP, 
ELTA,NN,KL,NTNA)     FIXEO    BIN; 
BINI31 ),KODEF    BIT(  1) ; 

MBIT(1),PAS    FIXED    BIN,LL,DENY    BITU); 
, LEVEL, MT; 


o; 

=  '0«  B; 

;  LPP=0; 
=0; 

OUT  INE  TO  CALC 

NODE  I  ANC  ITS 

THE  Dl  ROUTIN 

SKIP  LISTCSTA 
STARTING  DOWN 
LM  IS  THE  NUVB 
DC  LEVEL=LM-1 
MS  =  LVL(LEVEL)  ; 
MT=LVL(LEVEL-1 
DC  J=MS  TC  MT; 
DO     1=1     TC    MS-1 


/+        THE     U    OF    NODES    ASSIGNED    SO    FAR       "/ 
/*    THE    SAVED     ¥    OF    NODES    ASSI GNED-PRE VI OUS    AS 
/«    1     IF    A    NODE    HAS    BEEN    DENIED    ASSIGNMENT    */ 
PW=0;     /*    THESE     ARE    THE     SAVE    VARIABLES    */ 


ULATE    T(  I  )  ,    THE    LONGEST    PATH    BETWEEN 

SET    OF    STARTING    NODES  */ 

E    CAN    BE    MOVED    TO    AN    OUTTER    PROCEDURE    LATER 

PTING    DOWN    GRAPH*  ) ; 
THE    GRAPH  */ 

ER    OF     LEVELS  */ 

TO    1    BY    -l; 


)-l; 


*/ 


ho 


IF     INCM( I, J)=«1,B    THEN    DO; 
MN=NW< IPTR(IM+T(I ) ; 
-IF    MN>    T(J)    THEN    TU)  =  MN; 

END; 
END    Dl; 

PUT    SKIP    CATA(T); 

/*   DEPTH(I,1)=#  OF  ENTRIES  OF  STACK  I   */ 
/*  DEPTH(I,2)=  DEPTH  OF  STACK  I  */ 

PUT    SKIP    LISTCACTUAL    CRITICAL    PATH    LENGTHS     ICWM); 
C=ICWM;  /*    CRITICAL    PATH    LENGTH   */ 

DEPTH=0; 
W=0; 

kodef=«o»b; 
di=o; 

B=INCM;  /*  B  IS  THE  WORK  CONNECTIVITY  MATRIX  */ 

NR=0;  /*    #    OF    COLUMNS     IN    RESTOR         */ 

CO    I  =  LVL (1)    TO    N; 

W( I)=NW(IPTRU>) ; 

END; 

/*  */ 

/*       STEP    #2      */ 

S2:MIN=99999999; 

DC  1=1  TC  K; 

IF  MIN>DEPTH( 1,2)  THEN  DO; 

LP  =  1; 

P(1)=I ; 

MIN=DEPTH( 1,2) ; 

GO  TO  S21; 
END; 
IF  MIN=DEPTHU  ,2  )  THEN  DC; 

LP=LP+l; 

P(LP)=I; 
END; 

S21:  END; 
DO  1=1  TC  LP; 
IF  DEPTH(P(  I  »,2)-.=  0  THEN  DO; 

L  =  LIST(DEPTH(P(I),1) ,P{  I)) ; 

IF  L<0  THEN  GO  TO  S22;      /*  SKIP  OVER  DUMMY  VARIABLES  */ 

B<*,L)=»0«B;        /*   ELIMINATE  THE  COLUMN  FROM  B   */ 

END; 
S22:  END; 
/*  */ 

/*   STEP  #3   */ 

NOTN=0;    /*  *  OF  TERMINALS,  I.E.   101     */ 

DO  1=1  TO  n; 

/*  HAS    THE    NODE    BEEN    ASSIGNED    ALREADY?      */ 

DO    J=l    TO    AS; 

IF     I=ASIGN(J>    THEN    GO    TO    S3; 
FND; 


in 


IF 


S3 
IF 


/♦ 


ANY(B(I,*n  =  '0«B    THEN    DO; 

NCTN=NCTN+1; 

Q(NOTN  J  =  l; 

ENC; 
:     END; 
NGTN=0    THEN    DO; 

MIN=99999999; 

PO     I=LP+1    TO    K; 

IF    MIN    >    DEPTH(I,2)     THEN    MI N=DE PTH ( I  ,2 > ; 

END; 

CO    1=1     TO    LP; 

KL=DEPTH(I ,2)-MIN; 

IF    KL=0    THEN    GO    TO    S31; 

MN,DEPTH< I  ,1  )=DEPTH(  1,1 »♦!; 

DEPTH( I,2)=M IN ; 
SETTING    UP    THE    DUPMY    BLOCKS,-    THE    DEPTH    OF    THE    X    BLOCK    */ 

LIST(MN,  I)=KL; 
S31:    END; 


S2 


/* 


IR 


#4    I    5 


/*    RESTORATION    */ 


GO    TO 
ENC; 
NTNA=0; 
/* 

I*       STEPS 
J=l; 

S4:    CO    1=1    TC    K; 
IF     I=P(J»    THEN    DC; 
IF    J<LP    THEN    J  =  JU  ; 
GO    TC    ENDS*; 
ENC; 
IF  (DEPTH( I  ,2)-=0)     THEN    DC; 

L  =  LIST(DEPTH(I  ,1),I)  ; 

IF    L<0    THEN    GO    TO    ENCSA; 

NR=NP+l; 

RESTOR(  SNR)=B(»,L  )  ; 

REST(NR) =L; 

6(«,L)=,0»B; 

END; 
ENDS4:    END    S4 ; 
DO    1=1    TO    N; 
DO    J=l    TO    AS; 
IF     I=ASIGN(J) 
END; 
IF    ANY(R(I  ,*  M^O'B    THEN    DO; 

NTNA=MNA+1     ; 

R(NTNA)=I ; 

END; 
S5:    END; 
CO    1=1    TO    NUTN; 
V»(Q(  I  ))  =NW(  IPTRIQI  im+DEPTH(P(L)t2l 


/*    GET    THE    REST    OF    R    */ 


THEN    GO    TO    S5; 
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END; 

/*       WANT    SUCCESSORS    CF    NCRUM,    ROW    R( I)    TELLS       */ 
IF    NOTN-NTNA=0    THEN    GO    TO    S06; 
DO    1=1    TC    NTNA; 

DO  J=l  TC  NOTN;     /*  FOR  NODES  IN  R-Q   */ 
IF  R(I)=Q(J)  THEN  GO  TO  S51; 
END; 
MN=0; 

DC  J=R(l )+l  TO  N; 

IF  INCM<R(  I) ,J)=»1«B  THEN  IF  W(J)>MN  THEN  MN=W(J); 
/*  GET  THE  PREDECESSOR  FROM  THE  ORIGINAL  GRAPH  */ 
END; 

W<R( I)  )=NW( IPTR(R(I>  ll+MN; 
S51:  END; 
/*  */ 

/*  STEP  #6  */ 

S06:  CALL  SU86 ( M ,R ,NTNA , S, NS  I ; 
NSP=0; 

DO  1=1  TO  NS; 

DO  J=l  TO  NOTN; 
IF  S( I )=Q(  J)  THEN  DO; 
NSP=NSP+l; 
SP(NSP)=S(I); 
GO  TO  S6; 
END; 
END; 
S6:  END; 
IF  M<C  THEN  M=C; 
LD=M-C; 
IF  SCRAP=»1,B  £  LD>0  THEN  DO; 

KODEF=«l»B;  /*    FAILURE    SIGNALED   */ 

PUT    SKIP    EDIT(«TIME    ELAPSED    =    • , TIME-STEPZ , •    CEN TI SECONDS' ) 

<X(70)  ,A,F(16)  ,A) ; 
RETURN; 
END; 
IF    LD    >0    ThEN    IF    DENY=»1»B    THEN    DO; 
DC    J=l    7C    K; 
DC    I=PAS+1    TO    AS; 
IF    LIST(DEPTH( J  ,  1  )  ,  J  )  =  AS  IGN  (  I  )    THEN    DO; 

LIST(DEPTH(J ,1), J)=0;       /*  REMOVE  NODE  FROM  LIST  */ 
DEPTH ( J,l)  =  DEPTH(Jf 1  )-l; 

IF  DEPTH(J,1)<0  THEN  DEPTH ( J, 1 ) =0;    /*  PRECAUTION  ONLY  */ 
DEPTH<J,2)=DEPTH(J,2)~NW(IPTR(ASIGN( 1)1); 
GO  TO  S61; 
END; 
END; 
/*  REMOVE  DUMMY  BLOCKS  IF  THEY  EXIST  ON  THE  TOP  OF  THE  LISTS  */ 
S6l:  IF  LIST(DEPTH(J,l  ) , J )<0  THEN  DO; 

DEPTH( Jf2)=DEPTH(Jf2)>LIST(DEPTH(J,l)f J»; 


^ 


LIST(DEPTH(J,1  ), J)  =  0; 
DEPTH ( J,l )=DEPTH(J,1 )-l; 
END; 
END; 

AS=PAS;  /*    SET    ASSIGNMENT    COUNTER    AT    PREVIOUS    */ 

B=SAVEM;  /*    RESET    MATRIX    */ 

P=PP;  /*    RESET    P    VECTOR    */ 

LP=LPP;  /*    RESET    P    COUNTER    */ 

V»=PW;  /*    RESET    U    VECTOR    */ 

DENY=,0,B;  /*    RESET    DENY    BIT    TO    OFF    */ 

L=LL;  /*     SET    NODE    */ 

/*    PUSH    NODE    INTO    THE    STACK    */ 

NN,DEPTH(P(LP) , 1  I =CEPTH( P ( LP ) , 1 ) +1 ; 
LIST(NN,P< LP))=L; 

PUT    SKIP    ECIT(»STACK     IN    6- 8 • , P ( LP ) , • # • »L > ( A , F ( 5 )  ,X( 2 ) ) 
DEPTH(PILP) ,2) =W(L) ; 

PUT    SKIP    LIST( 'STACK*  ,P(LP), •#    IN    6-8»,L); 

AS  =  AS  +  1*. 

ASIGN(AS)=L; 
LP=LP-1; 
B(Lt*l='0'B5 
GO    TC    S9; 
END; 
/*  */ 

/*        STEP    47       */ 
C=M; 

IF    NS-NSP=0    THEN    LV=0; 
ELSE    DO; 

NSPC=l; 

LU=9S999999; 

DO     1=1     TO    NS;  /*    GET    LU    BY    FINDING    S-S»       */ 

IF    NSPONSP    THEN    GO    TO    S70; 

IF    S(  I  )  =  SP(NSPC)     THEN    DO; 

NSPC=NSPC+l; 

GO  TC  S700; 

END; 
S70:  LWD=W(S(I  ))-NW(  IPTR(S(  n  n; 

IF  LWD<LU  THEN  LU=LWD; 
S700:  END; 

S7:  NU=0;  /*   |U|   */ 

NX=0;  /*   I  X |   */ 

DO  1=1  TO  N;         /*  CCMPUTE  LU,U,X   »/ 
LWD=W(  I)-NW<  IPTR(  II); 
IF(LWD-.  =  LU)  THFN  GO  TO  ST70; 
NU=NU+l; 

u(NU)  =  i ; 

ST70:  END; 

DC  1=1  TC  K; 

IF  DEPTH( I ,2 )>LU  THEN  GO  TO  ST71; 


kk 


END; 


TO    ST8; 


IN    S»)       */ 


I))); 

CO; 


nx=nx*1; 

X(NX)=I; 

ST71:    END; 

LV=LP-NX+NU; 

IF    LV<0    THEN    LV=0; 

IF  LP<=LV|NSP=0  THEN  GO 
DO  WHILE(LP>LV6NSP>0); 
/*   GET  MAX(W(  I) |N( I)  IS 

mj=o; 

DO  1=1  TC  nsp; 
MN=NW(IPTR(SP< 
IF  HJ<MN  THEN 
MJ=MN; 
MI=I; 
END; 
END; 
/*  PUSH  MMI)  INTO  THE  LIST  */ 

MN,DEPTH(P(LP),1)=DEPTH(P(  LP),  1)4-1; 
LIST(MN,P(LP))=SP<MI ) ; 
PUT  EDIT( 'STACK*  ,P(LP)  ,  •  #• , SP(MI )) 

<X<5),A,F<5),X(2),A,F< 5)  ); 
/*  ELIMINATE  THE  SP(MI)  ROW  FROM  B  */ 

/*   G*  =  G«  -  MSP  (Mil  )  */ 

AS=AS+1; 

ASIGN(AS)  =  SP(MI)  ; 
mn=o; 

W(SP(MI)  )t DEPTH (P( LP)  , 2) =DEPTH ( P ( LP) , 2 ) +NW ( I PTR( SP (MI ) )); 
DO    1=1    TC    NSP;  /*    S*    =    S*    -    N(MI)       */ 

IF     I=MI    THEN    GC    TO    ST72; 
MN=MN*l; 
SP(MN)=SP(  I)  ; 
ST72:     ENC; 
LP=LP-l; 
NSP=NSP-1; 
END; 
GO    TO    S9; 
/*  */ 

/*       STEP    #8      */ 

ST8:     CALL    SUB6( M ,Q ,NCTN, SP, NSP)  ; 
BP=LP; 
PUT    SKIP    LIST(  «S"  »,NSP); 
PUT    LISTUSP(I)    DO    1=1    TC    NSP)); 
/*    FIND    WSJ=MAX(NW(I ) |N( I)     IS    IN    S'l    */ 
WSJ=0; 

DO    1=1    TC    NSP; 
MN=NW(IPTR(SP( I) )) ; 
IF    WSJ<MN    THEN    WSJ=MN; 
END; 
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MN=NSP; 

MT=0; 
ST80:  DO  J  =  l  TC  MN  WHILE  (LP>0  &  NSP>0  ); 

L=SP( Jl; 
IF  NW< IPTR(L) )=WSJ  THEN  DO; 

SP( J)=0; 

M  T=  1 ; 

NSP=NSP-1; 

DELTA=W(L)-LU;       END; 

IF    DELTA<=0    |     LP-LV>0    THEN  /*    PUSH    INTO    STACK      */ 

DC; 

NN, DEPTH (P( LP) ,1 )=D6  PTH( P(LP)tll  +  l! 
LIST(NN,P(LP)  >=L; 
PUT  SKIP  LIST(  'STEP  8'  ,L); 
DEPTH(P(LP) ,2)=W(L ); 

PUT  SKIP  LISTCSTACK'  ,P(  LP)  ,•  4    IN  8*,L); 

AS=AS+l; 

ASIGN( AS)=L; 
LP=LP-l; 
B<L,*)  =,0,B; 
END; 
ELSE     IF    DELTA<=DI(L) |DI( L ) =0    THEN    DO; 
DO    1=1     TO    NR  ; 
B<*  tRESTU  ))=RESTOR(*,  I  )  ; 
END; 


NR=0; 

di  (D=delta; 

PP=P; 

/* 

SAVE    P    VECTOR    +/ 

LPP=LP; 

/* 

SAVE    P    COUNTER    */ 

pw=w; 

/* 

SAVE    W    VECTOR    */ 

SAVEM=B; 

/* 

SAVE    MATRIX    */ 

pas=as; 

/* 

SAVE    ASSIGNED    COUNTER 

ll  =  l; 

/* 

SAVE    NODE    +/ 

DENY=»  1»B; 

/* 

SET    DENY    BIT    */ 

END; 

END    ST80; 

/*    COMPRESS    S«     */ 

IF    MT=1    I    NSP-=0 

THEN    DO; 

j=o; 

DO    1=1    TO    MN; 

IF     SP(I)-=L 

THEN    DC; 

J=J+1; 

SP(J)=SP(I ); 

END; 

END; 

end; 

IF    LP=RP    THEN    CO; 

MIN=99999999; 

1  =  1; 

*/ 


ke 


00    MN=1    TC    K; 

IF    MN=PU)€I<=LP    THEN    DO; 

1=1+1; 

GO    TO    ST805; 
END; 

IF    MIN>DEPTH(I  ,2)    THEN    MIN=DEPTH U ,2  ) ; 

ST805:    END; 

DC    1=1     TC    LP; 

KL=DEPTH(P(I ),2)-MIN; 

IF  KL=0  THEN  GO  TO  ST806; 

MN.DEPTHCPU ),1)=DEPTH<P(II,1)+1; 

LIST(MN,P(  I)  )=KL; 

DEPTH(P( I) ,2 )  =  MN; 
ST806:    END; 
END; 

/*  */ 

/*       STEP    #9      */ 

S9:    IF    ANY(B)=lO«B    THEN    GO    TC    SIO; 
S95:     DO    1=1    TO    NR ; 

B<* ,REST(I l)=RESTGR(*,I ) ;  /*    RESTORE    COLUMNS    */ 

END; 

NR=0;  /*  HAVING  RESTORED,  SET  COUNTER  TO  ZERO  */ 

GO  TO  S2; 
/*  */ 

/*  STEP  #10   */ 
SIO:  L=0; 

DC  J=l  TC  N; 

DO  1=1  TC  AS; 

IF  ASIGN(I)=J  THEN  GO  TC  S101; 

END; 

L  =  L  +  NW(IPTR(J)  )  ; 
S101:  END; 

nn=o; 

ms=o; 

DO  1=1  TO  K; 

MN=DEPTH(I,2 )i 

IF  MN>  MS  THEN  MS=MN; 

NN=NN+MN; 

END; 

NN=K*MS-NN; 

IF  NN<L  |  AS<N  THEN  GOTC  S95; 

NU=0; 

DO  1=1  TC  K; 

IF  NU<DEPTH( 1,2)  THEN  NU=DEPTH( I ,2 I ; 

END; 

PUT  SKIP(2)  EDIT(«THE  PROCESSORS  ARE  SCHEDULED  FOR  «,NU, 

•  TIME  UNITS'  KA,F(8),A)  ; 
/"    SET  UP  THE  ORIGINAL  NODE  NUMBERING  INTO  LISTS   */ 
S102:  DO  J=l  TC  K; 
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L=OEPTH< J,l »  ; 
MN=0; 

DC    1=1    TC    L; 
IF    LIST(  I.JKO    THEN    DO; 
MN=MN-LIST(I, J)  ; 
GO    TC    SI  03; 
END; 
LIST(  I,J)=IPTR(LIST<  [,JM; 
S1C3:    ENO; 

PUT    SKIP(2)     EDIT(»LIST», J, • :  •  )(AfF(5l,AI  ; 

PUT    SKIP    LIST( <LIST( I, J)CO    1=1    TC    LM; 

PUT    SKIP    ECIT(«LENGTH    OF    LIST    =     • , DEPTH(  J, 2)  ) ( A, F ( 6 )  )  ; 

IF    DEPTH(Jt2KNU    THEN    MN=NU-DEPT  H  (  J  ,  2  H-MN; 

RATIO=MN/NU*100; 

PIT    SKIP    EDIT(»?    OF     IDLE    TIME     IN    THIS    LIST    =     •     ,    RATIO) 

(A,F (10, 611; 
END    S102; 

KODEF=«0«B;  /*    SUCCESS    */ 

PUT    SKIP    ECIT('TIME    ELAPSED    =     • , T I  ME- STEPZ , *    C EN TI SECONDS' ) 

<X<70)  ,A,F(16)  ,A); 
RETURN; 
SUB6:     PRCC(N,R,NTNA,StNS); 
DCL    R(N)    FIXEC    RIN,S(N)     FIXED    BIN; 
CCL    I; 
M=0; 

NS=0; 

DC    1=1     TC    NTNA; 

MN  =  W(R( I  )  )+T(R (  I  )  »  ; 

IF    M<MN    THEN    M=MN; 

END; 

DC    1=1     TC    NTNA; 

IF     T(R(I ) )+W(R< I ) )=M    THEN    DO; 
NS=NS+l; 
S(NS)=R(  I)  ; 
ENC    SUB6; 
ENC    JCCMP; 
END    WSCHED; 
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